Skip to content

Conversation

@Danil-Grigorev
Copy link

@Danil-Grigorev Danil-Grigorev commented May 15, 2020

Working on https://issues.redhat.com/browse/OCPCLOUD-784

  • Re-enable metrics ports for all containers

This PR is adding support for reporting following prometheus metrics and also starting controller-runtime metrics server to make these metrics available for prometheus servers:

  • mapi_instance_create_failed: Total count of "create" cloud api errors
  • mapi_instance_update_failed: Total count of "update" cloud api errors
  • mapi_instance_delete_failed: Total count of "delete" cloud api errors

Labels on these metrics (for vSphere):

prometheus.Labels{
  "name": machine.Name,
  "namespace": machine.Namespace,
  "reason": error.Reason,
}

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2020
@openshift-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@Danil-Grigorev Danil-Grigorev marked this pull request as ready for review May 15, 2020 09:30
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2020
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments as I read through, let me know if you have any questions about them

prometheus.CounterOpts{
Name: "failed_machine_sets_total",
Help: "Number of times machine set provisioning has failed.",
}, []string{"name", "namespace", "timestamp", "reason"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should include the timestamp, does it add any extra information for the user do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not look like, if only having additional unique key for tracking successes or failures for same named machine sets, which sounds too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus will allow you to see this anyway, as it will track the value over time. So you can use a query to work out when the value changed and get a spike on the graph when it was incremented. Good thought though! Definitely the right line of thinking

prometheus.CounterOpts{
Name: "succeeded_machine_sets_total",
Help: "Number of times machine set provisioning has succeded.",
}, []string{"name", "namespace", "timestamp"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, don't think the timestamp adds anything?

}, []string{"name", "namespace", "timestamp", "reason"},
)

// SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine
// SucceededMachineSetProvisionCount calculates the number of success provisioning for a MachineSet

@Danil-Grigorev
Copy link
Author

/retest

1 similar comment
@Danil-Grigorev
Copy link
Author

/retest

@Danil-Grigorev Danil-Grigorev marked this pull request as draft May 18, 2020 22:00
@Danil-Grigorev Danil-Grigorev force-pushed the re-enable-metrics branch 3 times, most recently from 8ad2615 to d85718a Compare May 22, 2020 09:30
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realise there's some tidy up to do, added comments where there is so we don't forget them before merging. Added a couple of suggestions, but otherwise looking pretty good

@Danil-Grigorev Danil-Grigorev marked this pull request as ready for review May 22, 2020 10:17
@Danil-Grigorev Danil-Grigorev force-pushed the re-enable-metrics branch 2 times, most recently from 9430cb2 to 37c97ca Compare May 22, 2020 10:33
@Danil-Grigorev Danil-Grigorev requested a review from JoelSpeed May 22, 2020 10:38
@Danil-Grigorev Danil-Grigorev changed the title Re-enable metrics, added reporting for machineset operations Re-enable metrics, add metrics for vSphere May 22, 2020
@JoelSpeed
Copy link
Contributor

/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2020
}

const (
VSphereProvider string = "vsphere"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? The platform is implicit in any running cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming the failure rates will be collected across clusters, and aggregated by that value in a longer term?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are only exposed at each individual cluster scope. If anything were to aggregate them in the future we will consider introduce any further suitable label by then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we not want to have this label if we were bring telemetry back to RH about our controllers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my idea too. May be a better idea to embed the provider name resolving into MAO, so it will identify it from infrastructure config and set it before sending metrics, @enxebre @JoelSpeed?

Copy link
Member

@enxebre enxebre Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we not want to have this label if we were bring telemetry back to RH about our controllers?

If we do we, we can discuss it when there's a proposal for exposing machine API metrics for telemetry. As for today this property does not provide any value for the consumers of these metrics, i.e users owning the cluster. Most of the information around machine usage it's already expose as metrics by mao, you can get lists of failed machines and we could possibly group by error message. This is complementary.

k8s-app: machine-api-operator
sessionAffinity: None
---
apiVersion: v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the serviceMonitor that goes with this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can’t be merged until provider integration PRs get into repos, or the CI would fail as the machine metrics won’t be accessible. The serviceMonitor i therefore in #609

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or the CI would fail as the machine metrics won’t be accessible

can you elaborate? can you point me to the origin test that will fail?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this fire an alert?

Copy link
Author

@Danil-Grigorev Danil-Grigorev Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the service which is triggering it, but the serviceMonitor from the second PR #609 I included the merge order in Jira issue, posted it in slack too. While provider PRs are not merged, the metrics port is not served from the code, and while Prometheus is trying to connect, it causes machine metrics respond with 502. That results in alert in openshift-monitoring namespace. Joel and me already discussed the issue in slack, and decided to split the PR, instead of revendoring MAO PR branch in every provider.

Copy link
Member

@enxebre enxebre Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought prometheus was no triggering alert for that scenario. This would need to account for openstack and baremetal.
There's nothing stopping us from including the serviceMonitor for machineset and mhc in this PR and watching working right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve opened issues in both repos to make sure this won’t be left unanswered.

including machineset and mhc in the metrics should not be disruptive.

@Danil-Grigorev Danil-Grigorev force-pushed the re-enable-metrics branch 2 times, most recently from 72ea3b8 to 08bd94a Compare June 5, 2020 11:47
@Danil-Grigorev Danil-Grigorev requested a review from enxebre June 5, 2020 11:50
@enxebre
Copy link
Member

enxebre commented Jun 5, 2020

Can we please update the description of the PR to reflect that this is enabling the plumbing for metrics for machineSet and MHC controllers and the structured being used with sevice/serviceMonitor and rbac-proxies?
Can we please also include that in the commit description i.e git ci -m"" -m"here".
Can we please move the vsphere specific changes into its own commit?

This is looking great though generally speaking the smallest the PR the less friction you'll find. We could have get this working e2e for MHC and machineSet and then enabling the providers.

}
if moTask.Info.State == types.TaskInfoStateError {
metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
Name: r.machine.Name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are this metrics actually being exposed through this controller metrics server?
wouldn't this need to metrics.Registry.MustRegister(failedInstanceCreateCount) or anything?
https://github.com/kubernetes-sigs/controller-runtime/blob/c0438568a706ec61de31b92f4d76e7fb7e1007b9/pkg/internal/controller/metrics/metrics.go#L50

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danil-Grigorev Danil-Grigorev requested a review from enxebre June 10, 2020 18:33
@Danil-Grigorev Danil-Grigorev force-pushed the re-enable-metrics branch 2 times, most recently from 71dc77b to 028f0e9 Compare June 11, 2020 12:26
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danil-Grigorev we just merged the metrics doc for mao, would you mind adding something about these new metrics to that doc as well?

https://github.com/openshift/machine-api-operator/blob/master/docs/dev/metrics.md

@Danil-Grigorev
Copy link
Author

/test unit

1 similar comment
@Danil-Grigorev
Copy link
Author

/test unit

@JoelSpeed
Copy link
Contributor

/test unit
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@Danil-Grigorev
Copy link
Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@Danil-Grigorev
Copy link
Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-azure-operator eb4b161 link /test e2e-azure-operator

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 4c9abf9 into openshift:master Jun 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. release/4.6

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants